Risk scores, label bias, everything but the kitchen sink

In designing risk assessment algorithms, many scholars promote a “kitchen sink” approach, reasoning that more information yields more accurate predictions. We show, however, that this rationale often fails when algorithms are trained to predict a proxy of the true outcome, for instance, predicting arrest as a proxy for criminal behavior. With this “label bias,” one should exclude a feature if its correlation with the proxy and its correlation with the true outcome have opposite signs, conditional on the other model features. This criterion is often satisfied when a feature is weakly correlated with the true outcome, and, additionally, that feature and the true outcome are both direct causes of the proxy outcome. For example, criminal behavior and geography may be weakly correlated and, due to patterns of police deployment, direct causes of one’s arrest record—suggesting that excluding geography in criminal risk assessment will weaken an algorithm’s performance in predicting arrest but will improve its capacity to predict actual crime.


A Proof of Theorem 1
To start, note that for any square-integrable random variable Ŷ , Since Y ′ is square-integrable by assumption, so are ŶX,Z and ŶX (by the law of total variance), and so, where the penultimate line follows from the fact that E ŶX,Z = E ŶX = E[Y ′ ], and the last line follows from the law of total variance.Now, where we repeatedly applied the law of iterated expectations, and used the fact that ŶX is measurable with respect to X in the second equality.Eqs. ( 1) and (2) together establish Eq. (1) in the theorem statement.
Eq. ( 2) in the theorem statement now follows immediately, since where the inequality is strict if ŶX,Z ̸ = ŶX , establishing the result.□

B Proof of Corollary 1
By Theorem 1, it is sufficient to show that E Cov ŶX,Z , Y | X ≤ 0. We start by noting that then by the assumption of the theorem, Now, by repeatedly applying the law of iterated expectations, we have Similarly, we have Putting the above together, we get Finally, by Eq. (3), establishing the result.□

C Kitchen-Sink Models and Independent Noise
When the proxy label Y ′ and the true label Y simply differ by additive, independent noise, then it is advantageous to use all available information when constructing risk scores.The following proposition formalizes this statement.
Proposition 1 In the setting of Theorem 1, suppose Proof.First note that where the second equality uses the independence assumption.Similarly, we have where the third equality follows from the fact that and the last equality follows from the law of iterated expectations.Finally, since establishing the result.□

D A Stylized Model of Arrest and Behavior
We formally describe and analyze the SEM depicted in Figure 1.Our model has three independent exogenous variables ) that are independent of the first three, with Cov(U B 0 , U B 1 ) = δ ≥ 0. Now, for non-negative constants α, β, and γ, the key variables in the model are generated by the following linear structural equations: We set the variances of the exogenous variables (σ 2 Z , σ 2 A , and σ 2 B ) in a manner that ensures that the remaining variables (Z, B 0 , B 1 , A 0 , and A 1 ) are standardized, meaning they have mean 0 and variance 1-we show how to do this below.We can thus interpret their values as representing the extent to which individuals differ from the population averages.In the case of neighborhood (Z), we can think of its value as denoting the level of police enforcement in an area.
To start, we set σ 2 Z = 1, which ensures Var(Z) = 1.Now, since Z ⊥ ⊥ U B 0 , we have that Var(B 0 ) = Consequently, setting σ 2 B = 1 − β 2 ensures that Var(B 0 ) = 1 (and, similarly, that Var(B 1 ) = 1).Finally, as above, A + 2αγCov(Z, B 0 ).One especially nice aspect of linear graphical models is that the covariance between any two variables can be immediately computed from the edge weights via the the Wright rules (35,39).Specifically, when the nodes are standardized to have variance 1, then the covariance between any two variables in the graph is the sum, over all d-connected paths between the variables, of the product of the edge weights along the path.A path is d-connected if it does not pass through any colliders (i.e., nodes with head-to-head arrows along the path).To compute Cov(Z, B 0 ), observe that the only d-connected path between Z and B 0 is the direct path from Z to B 0 , having edge weight β.As a result, Cov(Z, B 0 ) = β, meaning that setting σ 2 A = 1 − α 2 − γ 2 − 2αβγ ensures that A 0 (and, analogously, A 1 ) have unit variance.Recapping, we have (5) Our model is thus described by the four non-negative parameters α, β, γ, and δ, depicted as edge weights in Figure 1, with the constraint that the quantities in Eq. ( 5) are non-negative.Those constraints in turn imply that the parameters are each less than or equal to 1.
Our theoretical results in Theorem 1 and Corollary 1 require understanding the conditional distributions of model features.For multivariate normal random variables, these conditional distributions can be computed analytically (40), allowing us to examine properties of our motivating SEM in more depth.Specifically, suppose W is a k-dimensional multivariate normal random variable with mean µ and covariance Σ, which we partition into into its first q components and its remaining k − q components: W = [W 1 , W 2 ].Further suppose we accordingly partition µ and Σ into its components: Then the distribution of W 1 conditional on W 2 is multivariate normal with mean and covariance As a result, the linearity assumption of Corollary 1 is satisfied for multivariate normal random variables.In particular, in our motivating example, the conditional distribution of A 1 given A 0 and Z is normal, with where the σ notation denotes the covariance of the indexed random variables.Further, the conditional distribution of (A 1 , Z) given A 0 is likewise multivariate normal, with covariance matrix and, analogously, we have that As above, we can compute the covariances in Eqs. ( 6) and ( 7) via the Wright rules.For example, as seen in Figure 1, there are two d-connected paths between Z and A 0 : the direct connection with edge weight α; and the path through B 0 , with product of edge weights βγ.Consequently, Cov(Z, A 0 ) = α + βγ.This procedure allows us to compute all of the terms appearing on the right-hand side of Eqs. ( 6) and ( 7), yielding: Leveraging the above, we now show that Cov(A 1 , Z | A 0 ) ≥ 0, meaning that neighborhood is positively correlated with future arrests, conditional on past arrests.To see this, first note that and so β 2 + δ ≤ 1.Now, where the first inequality follows from the fact that β 2 + δ ≤ 1.
Next we consider Cov(B 1 , Z | A 0 ), and note that In particular, when β = 0, meaning that neighborhood does not impact behavior, then In other words, when neighborhood does not impact behavior (i.e., when β = 0), neighborhood is negatively correlated with future behavior conditional on past arrests.(And, by the above, neighborhood is always positively correlated with future arrests conditional on past arrests.)By Corollary 1, it is thus better in this case to base predictions of future behavior solely on past arrests, excluding neighborhood, as we see in Figure 2.